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ABSTRACT 

Motivation: Eugene Myers in his string graph paper JMversl . [2005h 
suggested that in a string graph or equivalents a unitig graph, any 
path spells a valid assembly. As a string/unitig graph also encodes 
every valid assembly of reads, such a graph, provided that it can 
be constructed correctly, is in fact a lossless representation of 
reads. In principle, every analysis based on whole-genome shotgun 
sequencing (WGS) data, such as SNP and insertion/deletion (INDEL) 
calling, can also be achieved with unitigs. 

Results: To explore the feasibility of using de novo assembly in the 
context of resequencing, we developed a de novo assembler, fermi, 
that assembles lllumina short reads into unitigs while preserving 
most of information of the input reads. SNPs and INDELs can 
be called by mapping the unitigs against a reference genome. 
By applying the method on 35-fold human resequencing data, we 
showed that in comparison to the standard pipeline, our approach 
yields similar accuracy for SNP calling and better results for INDEL 
calling. It has higher sensitivity than other de novo assembly based 
methods for variant calling. Our work suggests that variant calling 
with de novo assembly can be a beneficial complement to the 
standard variant calling pipeline for whole-genome resequencing. In 
the methodological aspects, we proposed FMD-index for forward- 
backward extension of DNA sequences, a fast algorithm for finding all 
super-maximal exact matches and one-pass construction of unitigs 
from an FMD-index. 
Availability: http://github.com/lh3/fermi 
Contact: hengli@broadinstitute.org 

1 INTRODUCTION 

The rapidly decreasing sequencing cost has enabled whole-genome 
shotgun (WGS) resequencing at an affordable price. Many software 
packages have been developed to call variants, including SNPs, 
short insertions and deletions (INDELs) and structural variations 
(SVs), from WGS data. At present, the standard approach to 
variant calling is to map raw sequence reads against a reference 
genome and then to detect differences from the reference. It is well 
established and has be en proved to work from a single sample t o 
thousands of samples JlOOO Genomes Project Consortium!, l2010h . 
Nonetheless, a fundamental flaw in this mapping based approach 
is that mapping algorithms ignore the correlation between 
sequence reads. They are unable to take full advantage of 
data and may produce inconsistent outputs which complicate 
variant calling. This flaw has gradually attracted the attention 
of various research groups who subsequently proposed several 
methods to allev i ate t h e effect, including p ost alignment 
filtering JLi et all |2008| ; lOssowski etafl I2008I : iKrawitz et all 
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uOlOf) . iterati ve mapping (Manske and Kwiatkowski, 2009J), read 
realignment I AlbersetalJ, |20 1 Or Homer and Nelson . 2010t Li , 
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iDepristo et all 1201 If) and local assembly llCarnevali et al , 
However, because these methods still rely on the initial 
mapping, it is difficult for them to identify and recover mismapped 
or unmapped reads due to high sequence divergence, long insertions, 
SVs, copy number changes or misassemblies of the reference 
genome. They have not solved the problem from the root. 

Another distinct approach to variant calling that fundamentally 
avoids the flaw of the mapping based approach is to assemble 
sequence reads into contigs and to discover variants via assembly- 
to-assemby alignment. It was probably more widely used in the era 
of capillary sequencing. The assembly based method became less 
used since 2008 due to the great difficulties in assembling 25bp 
reads, but with longer paired-end reads and improved methodology, 
de novo assembly is reborn as the preferred choice for variant 
discovery between small genomes. 

For variant discovery between human genomes, however, 
the assembly based approach has not attracted much attention. 
Assembling a human genome is far more challenging than 
assembling a bacterial genome, firstly due to the sheer size of 
the genome, secondly to the rich repeats and thirdly due to 
the diploidy of the human genome. Many heuristics effective 
for assembling small genomes are not directly applicable to the 
human genome assembly. As a result, only a few de novo 
assemblers ha ve been applied on human short-rea d data. Amon 
them, AB ySS Islmpson et alll2009h. S OAPdenovo JLi et alll201 
and SGA JSimpson and Durbiiil2012r) . as of now, do not explicitly 
output heterozygotes. Although in theory it is possible to recover 
heterozygotes from their intermediate output, it may be difficult 
in practice as the assemblers m ay not dist i nguish heterozygotes 
from sequencing errors. Cortex (llqbal et all 120121) is specifically 
designed for retaining heterozygous variants in an assembly, but 
it ma y be missing heterozygotes. ALLPATHS-LG dGnerre et all 
1201 If) also paid particular attention to keep heterozygotes, but 
it still has relatively a low sensitivity. In addition, ALLPATHS- 
LG only works with reads from libraries with distinct insert size 
distributions and prefers read pairs with mean insert size below 
three times of the read length, while many resequencing projects do 
not meet these requirements and thus ALLPATHS-LG may not be 
applied or work to the best performance. Even if we also include 
de novo assemblers developed for capillary sequence reads, the 
version of the Celera ass embler used for assembling the HuRef 
genome JLevv et alll2007T) is the only one that retains heterozygotes 
while capable of assembling a mammalian genome. At last, one 
may think to map sequence reads back to the assembled contigs to 
recover heterozygous events, but this procedure will be affected by 
the same flaw of read mapping. To the best of our knowledge, no 
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existing de novo assemblers are able to achieve the sensitivity of the 
standard mapping based approach for a diploid mammalian genome. 
In this article, we will show the first time that the assembly based 
variant calling can achieve a SNP accuracy close to the standard 
mapping approach and have particular streng th in INDEL calling, 
confirming previous studies (Iqbal et al., 2012). In addition, the de 
novo assembly algorithm, fermi, developed for this practice is also 
a capable assembler for human assembly. 



2 METHODS 

The methods section is organized as follows. We first review the history of 
de novo assembly in the theoretical aspects, which leads to the rationale 
behind fermi: to use unitigs as a lossless representation of reads. We then 
summarize the notations used in the article and introduce bidirectional FM- 
index for DNA sequences. We will present several algorithms for assembling 
using t he bidirectional FM-i n dex. T he key algorithm is based on previous 
works ISimpson and Durbinl 120101) . but we need to adapt it to our new 
index. We also remove the recursion in the original algorithm. Finally we 
will discuss practical concerns in implementation. 

2.1 Theoretical background 

2.1.1 A history of the overlap-layout-consensus paradigm 
Compu ter assi s ted s e quence assemb l y can be dated back to the late 
1970s Istadenl 1 19791 : lijingeras et all fl979l) . In 1984, IPeltola et all first 
formulated the DNA assembling problem as finding the shortest string (the 
assembly) such that each sequence read can be mapped to the assembly 
within a required error rate. To solve the problem, they proposed a three- 
step procedure, which is essentially the overlap-layout-consensus (OLC) 
a pproach . 

iMversI (199511 pointed out that reducing DNA assembly to a shortest 
string problem is flawed in the presence of repeat. He (see also 
iKececioglu and Mvera 19950 further proposed the concept of overlap graph, 
where a vertex corresponds to a read and a bidirectional edge to an overlap. 
Naively, the DNA assembling problem can be cast as finding a path in the 
overlap graph such that each vertex/read is visited exactly once (though 
edge/overlap caused by repeats are not required to be traversed), equivalent 
to a Hamilton path problem which is known to be NP-complete. This has led 
many to believe that the OLC approach is theoretically crippled. 

However, this is a misbelief. Although the assembly problem can be 
reduced to a Hamilton path problem, it can be reduced to other problems 
as well and in practice almost no assemblers try to solve a Hamilton path 
problem. We note that a fundamental difference between a generic graph and 
an overlap graph is that the latter can be transitively reduced while retaining 
the read relationship. More formally, if Vi — > V2, l>2 ~ ► "3 an <i "i — > V3 
are all present, edge vi —> t>3 is said to be reducible. When we removed 
all the contained reads and reducible edges, a procedure called transitive 
reducti on, the result ing graph is still a loyal representation of the overlap 
graph IMversi 1 1 9951) . but the path corresponding to the assembly is not a 
Hamilton path any more because reads from repetitive regions need to be 
traversed multiple times. 

In a transitively reduced graph, if there exists v\ — > V2 with the out- 
degree of i'i and in-degree of no both equal to 1, we are able to merge v% 
and V2 into one vertex without altering the topology of the graph. After we 
performed all possible merges, we get a unitig graph in which each vertex 
corresponds to a unitig, representing a maximal linear sequence that can be 
resolved by reads. Multiple copies of a repeat may be collapsed to a single 
unitig. The concept of unitig helps to greatly sim plify an assembly g raph. It 
has played a central role in the Celera assembler JMyers et a l., 2000). 

On the other hand, introducin g unitigs has not theoretically solved the 
se quence assemb l y to the end. IMversI 1200511 . based on the suggestion 
by IPevzner et al.l J2001D . proposed to compute a traversal count for each 
edge and to remove false overlap edges by solving a minimum cost network 



flow problem. Finding the optimal assembly in the resultant graph can 
be reduced to a Eulerian path problem, which has a linear time solution. 
Myers origi nally applied this procedure to string graph, an equivalence to 
unitig graph. Medv edev and Brudnol {20090 pointed out that determining the 
traversal count can also be resolved directly in the bidirectional unitig graph 
using a bidirected network flow-based algorithm. 

Once we get a transitively reduced graph, the subsequent steps can be 
achieved in roughly linear time most of time - the worst case almost never 
happens globally in practice. However, deriving an overlap graph takes 
0{N' 2 ) time, where N is the number of reads, and transitive reduction takes 
at least O(E) time, where E is the number of edges which is usually much 
larger than N . This still makes an OLC based approach less favorable in 
short-read assembly where N c an be of the order of 10 9 . 

A breakthrough achieved by ISimpson and Durbinl J20101) finally solved 
this last remaining problem at least when we only consider exact overlaps. 
These authors developed an O(N) algorithm to find all the irreducible edges, 
effectively replacing the overlapping and transitive reduction phases. 

In summary, in the OLC paradigm, the sequence assembly problem can 
be practically solved in a time roughly lineal' in the total length of reads. 

2.1.2 De Bruijn graph and read coherence De Bruijn graph is an 
altern ative graph representation of sequence reads dldurv and Watermanl 
Il995l) . It can be trivially constructed with a simple linear-time algorithm. 
This makes the de Bruijn graph approach very attractive for assembling 
many short reads. 

However, de Bruijn is 'lossy'. From a theoretical point view, a de 
Bruijn graph is equivalent to an overlap graph built by splitting a long read 
into overlap fc-mers and requiring (fc-l)-mer exact overlaps between non- 
redundant fc-mers. Such a graph does not have transitive edges. Because 
long reads all effectively work as fc-bp reads in a de Bruijn graph, long- 
range information is lost. As a result, a path in the graph may be invalidated 
by reads. In contrast, in a unitig graph or equiva lently a string graph each 
path models a valid assembly from input reads. Myers (2005) called this 
property of path consistency as read coherence. 

Losing long-range information in reads, a de Bruijn graph by itself 
has reduced power to resolve short repeats. This flaw is usually amended 
by mapping reads back to the graph and bisecting repeats shorter than 
the reads, a procedure some called as read thread ing. Many de Buijn 
graph based assemblers essentially take this strategy ([Pevzner et all 1200 It 
IChaisson et aill2009l:IZerbino et allE009llLi et alll2010l) . though they may 
use different terminologies. 

With read threading, it is theoretically possible to transform a de Bruijn 
graph to a coherent graph, but in practice, threading is not straightforward 
and may be inefficient given complex repeat structures. For a coherent 
representation of reads, a unitig graph is simpler to construct. 

2.1.3 Concluding remark We noted that we only focused on the 
theoretical aspects of de novo assembly. In practice, many assemblers 
derived the final assembly by applying heuristics on the simplified graph 
instead of solving a network flow problem or a Eulerian problem. 
Furthermore, correcting errors, utilizing read pairs and c ontrolling memory 
usage all pose challenges to large-scale de novo assembly. Many practical 
problems are not solved perfectly. De novo assembly is still a field under 
active development. 



2.2 Rationale 

Being coherent, a perfectly constructed unitig graph annotated with per- 
unitig read counts in fact encapsulates all the information of reads and 
encodes no information invalidated by reads. In this sense, any unitig based 
analysis has an equivalent read based analysis, and vice versa. This article 
just uses this property to explore the applications for which we usually rely 
on reads. 
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Table 1. Notations 



Symbol Description 



C{a) + 0(a,l\P)-l) 
C{a) + 0(a,I u (P))-l 



(1) 
(2) 



T String: T = aoa± . . . a n _i with a„_i = $ 

\T\ Length of T including sentinels: \T\ = n 

T[i] The i-th symbol in string T: T[i] = a; 

T[i,j] Substring: T[i,j] = a; . . . aj 

Ti Suffix: Ti = T[i, n - 1] 

S Suffix array; S(i) is the position of the i-th smallest suffix 

B BWT: B[i] = T[S(i) - 1] if S(i) > or B[i] = $ otherwise 

C(a) Accumul. count array: C(a) = |{0 < i < n — 1 : T[i] < a}| 

0(a, i) Occurrence array: 0(a, i) = \{0 < j < i : B[j] = a}\ 

P oW String concatenation of string P and W 

Pa String concatenation of string P and symbol a: Pa = P o a 

P Watson-Crick reverse complement of DNA string P 



2.3 Strings and FM-index 

2.3.1 Strings with multiple sentinels Let S = {$, A, C, G, T, N} be 
the alphabet of DNA sequences with a predefined lexicographical order $ < 
A<C<G<T<N, where 'N' represents an ambiguous base and '$' is 
a sentinel that marks the end of a string. An element in £ is called a symbol 
and a sequence of symbols is called a string. Given a string T, let \T\ be the 
length of the string, T[i], i = 0, . . . , \T\ — 1, be the i-th symbol in the string, 
T[i,j], <i <j < |T|, be a s ubstring and Tj = T[i, \T\ - 1] be a suffix 
of T. Following the definition by Siren 12009), we define a string terminated 
with '$' as a text. A text may have multiple sentinals. In a text T, if T[i] = $ 
and T[j] = $, we mandate T[i] < T[j] if and only if i < j. Thus when 
we compare two suffixes of T, we do not need to compare beyond a sentinel 
because each sentinel has a different lexicographical rank. 

For two strings P and W, let P o W be their string concatenation. We 
may sometimes write P o W as PW if it is unambiguous in the context. 
Given an ordered set of texts, we call their ordered string concatenation as a 
collection, which is also a text. For example, suppose we have two reads. The 
first is ACG and the second is GTG. The collection of the two reads is T = 
ACG$GTG$. Suffix Tb < Tfe because the first seminal is lexicographically 
smaller than the second. 

For convenience, we assign an integer from to 5 to '$', 'A', 'C\ 
'G', 'T' and 'N', respectively. We may use both the integer and the letter 
representations throughout the article. In addition, given a symbol a, we 
define a as the Watson-Crick complement of a. We regard the complement 
of '$' and 'N' is identical to itself. 

2.3.2 FM-index The suffix array S of text T is a permutation of 
integers between and \T\ — 1, where S(i), < i < \T\, is the position 
of the i-th smallest suffix of T. Given a string P, the suffix array interval 
I l (P),I u (P)] of P in T is defined as 

rl {P) 



I 1 
I U {P) 



min{fc : P is the prefix of Tgi k \} 
max{fc : P is the prefix of Tgr k \} 



For convenience, we also define I B (P) = I U (P) — I l {P) + 1 as the size 
of the interval. 

The Burrows-Wheeler Transform IBurrows and Wheeleiill9941) . or BWT, 
of T is a permutation of symbols in T. The BWT string B is computed as 
B[i] = T[S(i) - 1] for S(i) > and B[i] = $ otherwise. Given a text T, 
also define the accumulative count array C(a) as the number of symbols in 
T that are lexicographically smaller than a, and the occurrence array 0(a, i) 
as the occurr ence of symbols a in B\0, i]. 

FM-index iFerrasina and Manzini, 2000) is a compressed representation 
of the BWT B, the occurrence array 0(a, i) and the suffix array S(i). The 



key property of FM-index is 

I l (aP) = 

I u (aP) = 

and I l {aP) < I u (aP) if and only if aP is a substring of T. We 
note t hat these two equatio ns are different from the ones in our previous 
paper iLi and Durbin. 2009J) in that C(a) and 0(a, i) defined here include 
the sentinels, but the two arrays in the previous paper exclude them. 

Given a collection T = QpQ i . . . Q n —i, we can r etrieve sequence Qi 
in linear time with Algorithm 1 JMakinen e t al., 2009). The second return 
value is the rank of Qi which equals \{Qj ■ Qj < Qi}\. 



Algorithm 1: Sequence retrieval 
Input: Sequence index i > 
Output: Sequence P and k, the rank of P 

Function GetSeq(i) begin 
k <— i 
P ^e 
while true do 

a <r- B[k] 

k <- C(a) + 0(a,k) -1 
if a = then 
[_ return (P,k) 
L P ^aP 



2.4 FMD-index 

Given DNA texts R ,...,R n -i, define T = RoRoR^Rt . . . R n _iP n _! 
as the bidirectional collection of the texts. We call the FM-index of T as the 
FMD-index of Rq, . . . , R n -i and define the bi-interval of a string P as 
[/' (P) , I (P) , I s (P)] . We will show how to compute the bi-interval of aP 
and Pa when we know the bi-interval of P. 

We note that when we know the bi-interval of P, I l (aP) and I s (aP) 
can be readily computed with Equation (T}. [I l (aP), I u (aP)] is a sub- 
interval of [P(P), I U (P)] because P is a prefix of aP = P o a. Due 
to the innate symmetry of T, I s (cP) = I s (cP) for all c g £ with 
J2c IS ( cP ) = /S ( P ) = IS (P)- We can compute I a (cP) for all c G S 
with Equation (TJ, use these interval sizes to divide [P(P), I U (P)] and 
finally derive [I l (aP), I u (aP)]. This completes the computation of the bi- 
interval of aP (Algorithm 2). Furthermore, when we backward extend P, 
we actually forward extend P. Conversely, backward extension of P yields 
forward extension of P (Algorithm 3). An F MD-index is bidire ctional. 

In comparison to the bidirectional BWT JLam et all 120091) which uses 
two FM-indices, the FMD-index builds both forward and reverse strand 
DNA sequences in one index. Although the FMD-index is not applicable to 
generic texts, it is conceptually more consistent with double-strand DNA and 
improves the speed of exact mat ching as we only need to search against one 
index. For example, BWA-SW iLi a nd Durbin, 20101) gets a 80% speedup 
when we adopt the FMD-index as the data structure. 

2.5 Unitig construction 

2.5.1 Labeling reads and overlaps Given a bidirectional collection 
T = RoRo ■ ■ ■ Rn-iRn-i, fermi labels the i-th input read Ri with an 
ordered integer pair [k,l], where k is the rank of Ri and I the rank of Ri. 
The pair [k,l] can be computed by GetSeq(2«) and GetSeq(2« + 1), 
respectively. We also call [k, I] as the bi-interval of read Ri. Obviously, the 
bi-interval of read Ri is [I , k] , with the two integer swapped. 

For two reads labeled by [k,l] and [k',1 1 ], if the tail (3'-end) of read 
[k, I] overlaps the head (5'-end) of [k', I'], we use an unordered integer pair 
(I, k 1 ) to label the overlap. Such is a tail-to-head overlap. Similarly, we use 
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Algorithm 2: Backward extension 



Input: Bi-interval [k, I, s] of string W and a symbol a 
Output: Bi-interval of string aW 

Function BACKWARDEXT([fc,Z, s],a) begin 
for b <— to 5 do 

k b «-C(6) + 0(6,fc-l) 

s b «- 0(6, k + s-1)- 0(b, k-l) 

lo *- I 

for b <- 3 to 1 do 



b+1 + s i)+l 



L '6 

'5 <- h + si 
return [fc a ,Z a ,s a ] 



Algorithm 3: Forward extension 



Input: Bi-interval [k, I, s] of string W and a symbol a 
Output: Bi-interval of string Wa 

Function FORWARDEXT([fc,Z, s],a) begin 
[/', k',s'] «— BACKWARDEXT([Z,fc,s],a) 
return [k',l',s'] 



(V , k) for a head-to-tail overlap, {1,1') for tail-to-tail and {k, k') for a head- 
to-head overlap. The four types of overlaps correspo nd to the four types of 
bidirectional edges in the bidirectional overlap graph (IMversLI 19951) . 

2.5.2 Finding irreducible overlaps Finding irreducible overlaps 
plays a central role in fermi as well as in SGA. Give n its importance, we 
prese nt a restructured version of this algorithm (SDlO lSimpson and Durbinl 
120101) using our notations (Algorithm 4). 

In Algorithm 4, line 1 computes the bi-interval of a single symbol. The 
loop at line 2 uses backward extensions to find all the reads overlapping with 
the input string P. The loop at line 3 uses forward extensions base by base to 
exclude reducible overlaps found at the previous step. W is this loop keeps 
the common substring of reads overlapping P extended from the 3 '-end of 
P. If in an iteration we find the sentinel of a read R (line 5), then all the 
reads sharing the same W with R must overlap with both R and P and 
therefore their overlaps with P are reducible. In this case, no further forward 
extensions are necessary (line 4 and 6). 

Similar to the original algorithm, Algorithm 4 requires that there are 
no contained reads. Fermi actually implements a modified version that 
detects reads containment on the fly, but we think the algorithm is a little 
overcomplicated. It is probably easier to filter contained reads first and then 
run Algorithm 4, as SGA does. 

2.5.3 Unitig construction Unitig construction is a process of 
unambiguous merge of overlapped reads. If [k,l] and [fc',Z'] have an 
irreducible overlap {I, k') and can be unambiguously merged, we label 
the merged sequence with [k,l'\; the similar can be applied to other three 
types of overlaps. With this simple labeling procedure, we are able to fully 
keep track of the graph topology during the unitig construction and without 
staging the graph in RAM. This procedure can also be easily multi-threaded. 

2.6 Finding the supermaximal exact matches 

An FMD-index can be used to find supermaximal exact matches (SMEMs) 
between a reference and a query sequence. Formally, a maximal exact match 
(MEM) is a an exact match that cannot be extended in either direction of the 
match. An SMEM is a MEM that is not contained in other MEMs on the 
query sequence. Fermi uses SMEMs to map reads back to the unitigs. 

Algorithm 5 describes the details. Basically, we use forward-backward 
extension to extend an exact match and detect the boundary of a maximal 



Algorithm 4: Finding irreducible overlaps (SD10) 
Input: Read P and the minimum overlap length x 
Output: Set of bi-intervals of reads having irreducible overlaps with the 
3 '-end of P 

Function lRROVERLAP(P,a;) begin 

Initialize Curr and Prev as empty arrays 

a-f- P[\P\ -1] 

[k, I, s] <- [C{a), C{a), C(a + 1) - C{a)] 
fori <- \P\ -2 toO do 
if \P\-i-l > a; then 

[k', I', s'] <-BACKWARDEXT([fc,Z> s], 0) 

if s' ^ then 
I Append Qk',l',s'],e) to Curr 

[jfe, I, s] <-BackwardExt([/c,Z, s], P[i]) 

Reverse array Curr, and swap Curr and Prev 
Finished = 
X = 

while Prev is not empty do 
Reset Curr to empty 
for {[k,l,s], W)inPrev do 
if W £ Finished then 
I continue 

[k', I', s'] <-FORWARDEXT([fc,Z, s], 0) 
if s' ^ then 

Finished <— Finished U {W} 

I<-lU{[fe',I',s']} 

continue 

for a <— 1 to 5 do 

[k',l',s'] <-FORWARDEXT([fc,Z, s],a) 
if s' 7^ and [k', V , s'] is not in Curr then 
I Append {[k', I', s'], Wa) to Curr 

Swap Curr and Prev 
return IrrOvlp 



match by tracking the change of interval sizes. Fermi implements a variant 
of Algorithm 5. It finds full-length read matches and can optionally exclude 
matches identical to the query sequence. 

2.7 Other implementation details 

2.7.1 Constructing FM-index To compute suffix arrays for strings 
with multiple se ntinels, we modifi ed an optimized implementation of the 
SA-IS algorithm JNong et aJTl201ll) by Yuta Mori. We used the established 
algorithm to merge BW Ts of subsets of reads faon et all [20071 1 Sirenl 120091 
iFerragina et all l2010fl . The BWT strin g is run-length encoded with the 
length in the delta encoding |Eliasj,[l975j). 

2.7.2 Error correction Fermi corrects potential sequencing errors 



using an algorithm similar to solving the spectrum alignment problem IPevzner et all 
120011) - corr ecting bases in u nderrepresented fc-mers. It also shares similarity 
to HiTEC illie et all 1201 lh . Nonetheless, the fermi's algorithm differs in 
that it is quality aware and does not replies on a user defined threshold on 
the fc-mer occurrences. 

Fermi corrects errors in two phases. In the first phase, it collect all 23- 
mer occurring 3 or more times using a top-down traversal over the trie 
represented by the FMD-index. For each such 23-mer, fermi counts the 
occurrences of the next (i.e. the 24-th) base and stores the information in 
a hash table with the 23-mer being the key. In the second phase, fermi 
processes each read by using the 23-mer hash table to correct errors by 
minimizing a heuristic cost function of base quality and the occurrences 
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Algorithm 5: Finding super-maximal exact matches 

Input: String P and stall position in; P[— 1] = 
Output: Set of bi-intervals of SMEMs overlapping io 

Function SuperMEM1(P,x) begin 

Initialize Curr, Prev and Match as empty arrays 

[k,l, s] <- [C(P[i ]),C(P\io\), C(P[i ] + 1) - C(P[i ] 
for i <— io + 1 to \P\ do 
ifi= P|then 
|_ Append [k, I, s] to Curr 

else 

[k\ V, s'] ^FORWARDEXT([fe,/, s], P{%\) 

its' ^ s then 

|_ Append [k, I, s] to Curr 

if s' = then 
break 

[k,l,s] <- [k',l',s'} 

Swap array Curr and Prev 

i' <- \P\ ' 

for j <— io — 1 to — 1 do 

Reset Curr to empty 

s" <- -1 

for [k,l, s] in Prev do 

[k 1 , I', S 1 ] <-BACKWARDEXT([fc,Z, s], P[l\) 

if s' = or i = —1 then 

if Curr is empty and i + 1 < i' + 1 then 

i' •«— j 

Append [k, I, s] to Match 

if s' ^ and s' 7^ s" then 

S" <- S' 

Append [k, I, s] to Curr 

if Curr is empty then 

L break 
Swap Curr and Prev 
return Match 



of the 24-th base. Roughly speaking, fermi tries to correct a low-quality 
base if by looking up its 23-mer prefix we know the base is different from 
an overwhelmingly frequent 24-th base. This algorithm can be adapted to 
correct INDEL sequencing errors in principle, but more works are needed to 
perform minimization efficiently. 

2.7.3 Simplifying complex bubbles A bubble is a directed acyclic 
subgraph with a single source and a single sink having at least two paths 
between the source and the sink. A closed bubble is a bubble with no 
incomming edges from or outgoing edges to other parts of the entire 
graph, except at the source and the sink vertices. A closed bubble is 
simple if there are exactly two paths between the source and the sink; 
otherwise it is complex. In de novo assembly, a bubble is frequently caused 
by sequencing errors or heterozygotes. Most short-read assemblers uses 
a modified Dijkstra's algorithm to pop bubbles progressively. Such an 
algorithm works fine for haploid genomes, but it is not straightforward to 
distinguish heterozygotes from errors when the bubble is complex. 

Fermi uses a different algorithm. It effectively performs topological 
sorting from the end of a vertex while keeping track of the top two paths 
containing most reads. A bubble is detected when every path ends at a single 
vertex. It then drops vertices not on the top two paths and thus turns a 
complex bubble to a simple one. 

2.7.4 Using the paired-end information Given paired-end reads 
with short-insert sizes, fermi maps reads back to the unitigs with 



Table 2. Statistics on human whole-genome assemblies 



AUPaths-LG fermi SGA SOAP HuRef 



Aligned contig bases 


2.62G 


2.82G 


2.74G 


2.71G 


2.88G 


Aligned N50 


22.6k 


15.6k 


9.8k 


7.0k 


81.4k 


Covered ref. bases 


2.59G 


2.74G 


2.70G 


2.67G 


2.78G 


# Type- 1 breaks 


13,738 


5,704 


6,049 


4,962 


16,318 


# Type-2 breaks 


3,823 


1,120 


1,735 


727 


6,626 



Contigs over 150bp in length are aligned to the human reference genome GRCh37 with 
BWA-SW using option '-b33 -q50 -rl7\ A type-1 break point is detected if a contig is 
split in alignment and mapped to two distict locations, and at each location the alignment 
is longer than 500bp and the mapping quality is no less than 10. Type-2 break points 
exclude type-1 break points which can be patched with gaps no longer than 500bp. 



Algorithm 5. If two unitigs are linked by at least five read pairs, fermi will 
locally assemble the ends of unitigs together with unpaired reads pointing to 
the gap under a relax setting. Fermi tries to align the ends of unitigs using 
the Smith-Waterman algorithm, which may reveal imperfect overlaps caused 
by sequencing errors or heterozygotes. Fermi also uses paired-end reads to 
break contigs at regions without bridging read pairs. This helps to reduce 
misassemblies during the unitig construction. 

3 RESULTS 

We ev aluated fermi on lOlbp paired-end reads from NA12878 (IDepristo et al.l 
120111) . The total coverage of the original data is about 70-fold, but 
we only used half of them. We assembled the 35-fold reads with 
fermi on a machine with 12 CPUs and 96GB memory in about 5 
days. The peak memory usage is 92GB. 

We obtained unitigs of N50 l,022bp, totaling 3.83Gb. After 
collapsing most heterozygotes and closing gaps with paired-end 
reads, we got longer contigs (Table 4). Unitigs are short and 
redundant mainly because they break at heterozygotes. 

For SNP and INDEL c alling, we aligned uni tigs to the reference 
genome using BWA-SW JLi and Durbiril201fj) with command line 
options '-b9 -ql6 -rl -w500\ We called SNPs with the SAMtools 
caller and INDELs by directly counting INDELs from the pileup 
output. We did not run a standard INDEL caller as short-read 
INDEL callers do not work well with long contig sequences. 

3.1 Performance on de novo assembly 

We obtained the ALLPATHS-LG NA12878 contigs from NCBI 
(AQAEKPOIOOOOOO ), the SGA and SOAPdenov o scaffolds from 
http://bit.l y/jtsl2878 Isimpson and Durbinl I2012T) . and the HuRef 
assembly ( [Levy et all 120070 for the comparison to the traditional 
capillary assembly. For the SGA and SOAPdenovo scaffolds, we 
split at any ambiguous bases to get contigs; for the HuRef assembly, 
we split at contiguous 'N' longer than 20bp. 

From Table 2, we can see that the HuRef assembly has much 
better contiguity than short-read assemblies. It contains more 
alignment break points when aligned against the GRCh37, but 
these are not necessarily all misassemblies. ALLPATHS-LG has 
longer N50 than fermi, SGA and SOAPdenovo partly because it 
uses 100-fold data and reads from jumping libraries. However, the 
ALLPATHS-LG assembly covers fewer reference bases and yields 
over twice as many as break points. Between fermi, SGA and 
SOAPdenovo which are applied on the same data set, fermi has 



Li 



Table 3. Evaluated SNP and INDEL call sets 



Table 5. Fraction of INDELs found in other call sets 



Label 


Data 


Assembler 


Mapper 


Caller 




MD 


CG 


BS 


CV 


FC 


MI 


ALL 


AC 


96X Illumina PE 1 


AUPaths-LG 


BWA-SW 2 


SAMtools 3 


MD 


240424 


0.819 


0.937 


0.678 


0.947 


0.054 


0.977 


BS 


70X Illumina PE 




BWA 4 


SAMtools 


CG 


0.752 


264696 


0.915 


0.629 


0.924 


0.052 


0.965 


CG 


Complete Genom. 




cgatools2 5 


cgatools2 


BS 


0.564 


0.597 


404646 


0.498 


0.844 


0.044 


0.906 


CV 


26X Illumina SE 6 


Cortex 




Cortex-var 


CV 


0.708 


0.726 


0.882 


251769 


0.902 


0.052 


0.923 


FC 


35X Illumina SE 6 


fermi 


BWA-SW 2 


SAMtools 3 


FC 


0.588 


0.624 


0.873 


0.522 


393841 


0.045 


0.952 


MD 


60X multiple 




MAQ 


lOOOg pilot 7 


MI 


0.593 


0.618 


0.790 


0.527 


0.804 


23216 


0.864 


MI 


Capillary reads 8 
35X Illumina SE 6 
























SS 




BWA-SW 


SAMtools 


INDELs that start 


within a homopolymer 


iin longer than 6bp are 


excluded 


in all call 












sets. An INDEL in 













AS uses reads from Illumina jumping and fosmid libraries 
" BWA-SW is invoked with 'bwa bwasw -b9 -ql6 -rl -w500' 
' INDELs are called fr om pileup without usi ng the SAMtools caller 
4 Realigned by GATK iDenristo et aTll201 lh als o around known INDELs 
° By Complete Genomics Drmanac et al.. 2010); only 'VQHIGLT calls retained 



CV, FC and SS do not use the pairing information in calling 

1000 Genomes Project pilot calls; generated from Dindel and multiple SNP callers 
8 INDEL calls bv lMills et all |201 lh 



longer N50 with similar number of alignment break points. Overall, 
fermi achieves comparable assembly quality to other assemblers. 

3.2 Performance on SNP and INDEL calling 

One of the key motivations of fermi is to explore the power of de 
novo assembly in calling short variants. We collected several SNP 
and INDEL call sets (Table 3) and compared the performance of 
fermi (Table 4 and 5). 

For SNP calling (Table 4), fermi misses 3% of SNPs called in 
SS, but finds more additional ones. Manual examination reveals 
that the additional calls are mainly caused by two factors. Firstly, 
in the single-end mode, BWA-SW is very conservative. It may 
consistently give a correct alignment a low mapping quality which 
are all downweighted by samtools. Fermi is able to assemble such 
reads into longer sequences which increase the power of BWA-SW. 
Secondly, in the fermi alignment, some regions may be mapped 
with a high mismatching rate. These may be due to small-scale 



Table 4. Statistics of SNP call sets 





FC 


CV 


SS 


BS 


CG 


MD 


#SNPs (M) 


3.37 


2.20 


3.24 


3.50 


3.34 


2.69 


#hets (M) 


1.97 


1.04 


1.94 


2.11 


2.04 


1.65 


ts/tv 


2.04 


2.03 


2.08 


2.11 


2.12 


2.06 


DN50 (bp) 


3,593 


6,662 


3,523 


3,392 


3,447 


3,992 


DN2/DN50 


22.3 


20.8 


23.4 


22.7 


22.3 


22.9 



Ts/tv is the transition-to-transversion ratio of SNPs. DN50 is calculated 
as follows. The reference genome is masked according to the align-ability 
mask i http://bit.ly/snpablej and segmented into intervals at heterozygous SNPs. DN50 
is computed such as 50% of unique positions in the genome are in intervals longer 
than DN50. DN2 is calculated similarly and D2/DN50 is the ratio of DN2 and 
DN50. DN50 measures the sensitivity; the smaller the better. DN2/DN50 measures 
the precision of heterozygous SNPs; the higher the better. 



by column) if there exists an INDEL in C such that the left-aligned starting positions of 
the two INDELs are within 20bp from each other. An INDEL in R is considered to be 
found in 'ALL' if it is found in one of the other INDEL sets in the table, plus the AC call 
set. In the table, a number on the diagonal equals | R\ , the number of INDEL calls in the 
call set. The fraction equals \{g £ R : g is found in C}|/|i?|. 



misassemblies in fermi unitigs or in the reference assembly, or copy- 
number variations. It is possible that these clustered SNPs contain 
more errors. Such errors may lead to reduced ts/tv, but tend not 
to break long homozygous blocks due to very recent coalescences. 
That is why FC has a high DN2/DN50 ratio, which measures how 
often false heterozygotes arise from a long homozygous block. 

Table 5 shows the comparison between different INDEL call 
sets. We excluded INDELs around long homopolymer runs in all 
call sets because INDEL sequencing errors tend to occur around 
long homopolymer runs and their error profile is still unclear (the 
1000 Genomes Project Analysis group, personal communication). 
In addition, we have excluded the SS INDEL call set which is nearly 
contained in BS due to the use of the same INDEL caller. 

For the call sets in Table 5, MD and CG are relatively small 
due to the use of very short reads. CV uses 26X lOObp reads. It 
is a sma ll call set due to t he high false negative rate of the calling 
method dlqbal et all 1 20 121) . The fermi call set FC is slightly smaller 
than BS, but it has larger overlap with other call sets than BS, and 
more FC calls are confirmed by others. One explanation to the lower 
overlapping ratio between BS and ALL is that BS is the only call 
set that uses lOlbp paired-end information, which gives it higher 
power for INDELs not detectable with single-end or very short 
reads. Nonetheless, purely based on Table 5, fermi appears to have 
higher overall accuracy. 

Even with all short-read call sets combined, a s many as 14% 
of double-hit INDELs called bv lMills et all feoill) are missed. We 
manually checked 30 missing INDELs in an alignment viewer. For 
half of the cases, the short-read alignment and fermi alignment 
strongly suggest no variations, and for all these cases, the HuRef 
sequences are identical to GRCh37. In addition, there are a few 
cases called from regions un der clear copy-num ber changes. In all, 
we believe INDELs called bv lMills et alj J20111) only may have high 
error rate. With short reads, we can recover most of short INDELs 
found by capillary sequencing. 



4 DISCUSSIONS 

In this article, we derived FMD-index by storing both forward 
and reverse complement DNA sequences in FM-index. This simple 
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modification ena bles faster forw ard-backward search than bi- 
directional BWT l ILam et all 120091) and makes FMD-index a more 
natural representation of DNA sequences. Based on FMD-index, we 
developed a new de novo assembler, fermi, which achieves similar 
quality to other mainstream assemblers. 

We demonstrated that it is possible to call SNPs and short 
INDELs by aligning assembled unitigs to the reference genome. 
This approach has similar SNP accuracy to the standard mapping 
based SNP calling and arguably outperforms the existing methods 
on INDEL calling in terms of both sensitivity and precision. 
Assembly based variant calling is a practical and beneficial 
complement to mapping based calling. 

In the course of evaluating INDEL accuracy, we found that 
outside long homopolymer regions, INDEL call sets do not often 
contain false positives, but they may have high false negative 
rate, which leads to the apparent small overlap between call 
sets ( lLam et alll2012h . 

As a theoretical remark, we note that with read counts kept, 
unitigs are a lossless but reduced representation of sequence reads. 
They are 'reduced' in that individual reads are lost; they are 
'lossless' in that all the information in reads, such as small variants, 
copy numbers and structural changes are fully preserved in unitigs, 
as long as they are constructed correctly. For single-end reads, 
it is theoretically possible to 'compress' reads to unitigs, which 
are largely non-redundant and much smaller in size. Accurately 
and efficiently constructing unitigs might provide an interesting 
alternative to data storage and downstream analyses in future, 
though practical challenges, such as the high computational cost and 
the lack of accuracy of unitigs, remain at present. 
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